Description of the dataset
Citation Request: This dataset is public available for research. The details are described in [Cortez et al., 2009]. Please include this citation if you plan to use this database:
P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.
Available at: [@Elsevier] http://dx.doi.org/10.1016/j.dss.2009.05.016 [Pre-press (pdf)] http://www3.dsi.uminho.pt/pcortez/winequality09.pdf [bib] http://www3.dsi.uminho.pt/pcortez/dss09.bib
Title: Wine Quality
Sources
Created by: Paulo Cortez (Univ. Minho), Antonio Cerdeira, Fernando Almeida, Telmo Matos and Jose Reis (CVRVV) @ 2009
Past Usage:
P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.
In the above reference, two datasets were created, using red and white wine samples. The inputs include objective tests (e.g. PH values) and the output is based on sensory data (median of at least 3 evaluations made by wine experts). Each expert graded the wine quality between 0 (very bad) and 10 (very excellent). Several data mining methods were applied to model these datasets under a regression approach. The support vector machine model achieved the best results. Several metrics were computed: MAD, confusion matrix for a fixed error tolerance (T), etc. Also, we plot the relative importances of the input variables (as measured by a sensitivity analysis procedure).
Relevant Information:
The two datasets are related to red and white variants of the Portuguese “Vinho Verde” wine. For more details, consult: http://www.vinhoverde.pt/en/ or the reference [Cortez et al., 2009]. Due to privacy and logistic issues, only physicochemical (inputs) and sensory (the output) variables are available (e.g. there is no data about grape types, wine brand, wine selling price, etc.).
These datasets can be viewed as classification or regression tasks. The classes are ordered and not balanced (e.g. there are munch more normal wines than excellent or poor ones). Outlier detection algorithms could be used to detect the few excellent or poor wines. Also, we are not sure if all input variables are relevant. So it could be interesting to test feature selection methods.
Number of Instances: red wine - 1599; white wine - 4898.
Number of Attributes: 11 + output attribute
Note: several of the attributes may be correlated, thus it makes sense to apply some sort of feature selection.
Attribute information:
For more information, read [Cortez et al., 2009].
Input variables (based on physicochemical tests): 1 - fixed acidity (tartaric acid - g / dm^3) 2 - volatile acidity (acetic acid - g / dm^3) 3 - citric acid (g / dm^3) 4 - residual sugar (g / dm^3) 5 - chlorides (sodium chloride - g / dm^3 6 - free sulfur dioxide (mg / dm^3) 7 - total sulfur dioxide (mg / dm^3) 8 - density (g / cm^3) 9 - pH 10 - sulphates (potassium sulphate - g / dm3) 11 - alcohol (% by volume) Output variable (based on sensory data):
12 - quality (score between 0 and 10)
Missing Attribute Values: None
Description of attributes:
1 - fixed acidity: most acids involved with wine or fixed or nonvolatile (do not evaporate readily)
2 - volatile acidity: the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste
3 - citric acid: found in small quantities, citric acid can add ‘freshness’ and flavor to wines
4 - residual sugar: the amount of sugar remaining after fermentation stops, it’s rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet
5 - chlorides: the amount of salt in the wine
6 - free sulfur dioxide: the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine
7 - total sulfur dioxide: amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine
8 - density: the density of wine is close to that of water depending on the percent alcohol and sugar content
9 - pH: describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale
10 - sulphates: a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant
11 - alcohol: the percent alcohol content of the wine
12 - quality (score between 0 and 10)
## 'data.frame': 1599 obs. of 15 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
## $ quality.bucket : Factor w/ 6 levels "3","4","5","6",..: 3 3 3 4 3 3 3 5 5 3 ...
## $ total.acidity : num 8.1 8.68 8.6 12.04 8.1 ...
There are 1599 wines evaluated in the dataset Evaluation is based on 12 variables (11 continuous and 1 discreet) Two extra variable are added to the dataset: - quality.bucket = categorical variable for quality - total acidity = fixed.acidity + volatile.acidity + citric.acid
Univariate Plots Section

## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.636 6.000 8.000
Red wines are normally distributed in quality with notes ranging from 3 (poor) to 8 (excellent).
Median note is 6 and mean is slightly lower at 5.6.

## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.60 7.10 7.90 8.32 9.20 15.90
Fixed acidity is ranging from 4.6 g / dm^3 to 15.9 g / dm^3.
Distribution is positively skewed (median 5.9 g / dm^3, mean 8.32 g / dm^3).
Fixed acidity seems to have a weak influence on wine rating.

## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1200 0.3900 0.5200 0.5278 0.6400 1.5800
Volatile acidity is ranging from 0.12 g / dm^3 to 1.58 g / dm^3.
Distribution is close to normal (median 0.52 g / dm^3, mean 0.5278 g / dm^3).
There is one outlier with value 1.5 g / dm^3.
Wines with good rating (7 or 8) tend to have less volatile acidity than the average wine.

## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.090 0.260 0.271 0.420 1.000
Citric acid is ranging from 0 g / dm^3 to 1 g / dm^3.
Distribution is positively skewed (median 0.26 g / dm^3, mean 0.271 g / dm^3).
Distribution has several modes at 0 g / dm^3, 0.25 g / dm^3 and 0.5 g / dm^3.
There is one outlier with value 1 g / dm^3.
Wines with good rating (7 or 8) tend to have more citric acid than the average wine.

## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.270 7.827 8.720 9.118 10.070 17.045
Total acidity is ranging from 5.27 g / dm^3 to 17.075 g / dm^3.
Distribution is positively skewed (median 8.72 g / dm^3, mean 9.118 g / dm^3).
There are some outliers with values above 15 g / dm^3.
Total acidity seems to have a weak influence on wine rating.

## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.900 1.900 2.200 2.539 2.600 15.500
Residual sugar is ranging from 0.9 g / dm^3 to 15.5 g / dm^3.
There are no sweet wine in the dataset (residual sugar above 45 g / dm^3).
Distribution is close to normal (median 2.2 g / dm^3, mean 2.539 g / dm^3).
Some outliers with values above 4 g / dm^3.
Residual sugar seems to have a weak influence on wine rating.

## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100
Chlorides is ranging from 0.012 g / dm^3 to 0.611 g / dm^3.
Distribution is close to normal (median 0.079 g / dm^3, mean 0.08747 g / dm^3).
There are few outliers for values above 0.3 g / dm^3.
Chlorides seems to have a weak influence on wine rating.

## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 7.00 14.00 15.87 21.00 72.00
Free sulfur dioxide is ranging from 1 mg / dm^3 to 72 mg / dm^3.
Distribution is positively skewed (median 14 mg / dm^3, mean 15.87 mg / dm^3).
Free sulfur seems to have a weak influence on wine rating.

## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.00 22.00 38.00 46.47 62.00 289.00
Total sulfur dioxide is ranging from 6 mg / dm^3 to 289 mg / dm^3.
Distribution is positively skewed (median 38 mg / dm^3, mean 46.47 mg / dm^3).
There are few outliers for values above 200 mg / dm^3.
Total sulfur dioxide seems to have a weak influence on wine rating.

## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9901 0.9956 0.9968 0.9967 0.9978 1.0037
Density is ranging from 0.9901 g / cm^3 to 1.0037 g / cm^3.
Distribution is close to normal (median 0.9968 g / cm^3, mean 0.9967 g / cm^3).
Density seems to have a weak influence on wine rating.

## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.740 3.210 3.310 3.311 3.400 4.010
pH is ranging from 2.74 to 4.01.
Distribution is close to normal (median 3.31, mean 3.311).
pH seems to have a weak influence on wine rating.

## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3300 0.5500 0.6200 0.6581 0.7300 2.0000
Sulphates is ranging from 0.33 g / dm^3 to 2 g / dm^3.
Distribution is positively skewed (median 0.62 g / dm^3, mean 0.6581 g / dm^3).
There are few outliers for values above 1.5 g / dm^3.
Wines with good rating (7 or 8) tend to have slightly more sulphates than the average wine.

## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.50 10.20 10.42 11.10 14.90
Alcohol is ranging from 8.4 % to 14.9 %.
Distribution is positively skewed (median 10.2 %, mean 10.42 %).
There are few outliers for values above 14 %.
Wines with good rating (7 or 8) tend to have more alcohol than the average wine.
Bivariate Plots Section
Pairplots

Martix plot allows to better understand the correlation between the variables of the dataset.
It confirms that quality is mainly correlated with alcohol, sulphates, citric acid (positively) and volatile acidity (negatively).
Several links between supporting variables are also identified:
- pH and acidity (which is quite obvious)
- density with alcohol, acidity and residual sugar
Other bivariate plots

##
## Pearson's product-moment correlation
##
## data: wineQualityReds$quality and wineQualityReds$alcohol
## t = 21.639, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4373540 0.5132081
## sample estimates:
## cor
## 0.4761663
Bivariate plot quality vs alcohol confirms our intuition and shows a clear correlation between both variables.
Pearson correlation coefficient between alcohol and quality is 0.48

##
## Pearson's product-moment correlation
##
## data: wineQualityReds$quality and log10(wineQualityReds$sulphates)
## t = 12.967, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.2636092 0.3523323
## sample estimates:
## cor
## 0.3086419
Bivariate plot quality vs log10(sulphates) also shows a correlation.
Pearson correlation coefficient between log10(sulphates) and quality is 0.3

##
## Pearson's product-moment correlation
##
## data: wineQualityReds$quality and wineQualityReds$citric.acid
## t = 9.2875, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.1793415 0.2723711
## sample estimates:
## cor
## 0.2263725
Bivariate plot quality vs citric.acid shows a weak correlation.
Pearson correlation coefficient between citric.acid and quality is 0.22

##
## Pearson's product-moment correlation
##
## data: wineQualityReds$quality and wineQualityReds$volatile.acidity
## t = -16.954, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.4313210 -0.3482032
## sample estimates:
## cor
## -0.3905578
Bivariate plot quality vs volatile.acidity shows a negative correlation.
Pearson correlation coefficient between volatile.acidity and quality is -0.39

##
## Pearson's product-moment correlation
##
## data: wineQualityReds$density and wineQualityReds$alcohol
## t = -22.838, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.5322547 -0.4583061
## sample estimates:
## cor
## -0.4961798
##
## Pearson's product-moment correlation
##
## data: wineQualityReds$density and wineQualityReds$fixed.acidity
## t = 35.877, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.6399847 0.6943302
## sample estimates:
## cor
## 0.6680473
##
## Pearson's product-moment correlation
##
## data: wineQualityReds$density and log10(wineQualityReds$residual.sugar)
## t = 18.363, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.3762175 0.4572012
## sample estimates:
## cor
## 0.4175381
##
## Pearson's product-moment correlation
##
## data: wineQualityReds$density and wineQualityReds$pH
## t = -14.53, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.3842835 -0.2976642
## sample estimates:
## cor
## -0.3416993
There are clear correlations between wine density and alcohol (-O.5), fixed acidity (0.67), residual sugar (0.42 with log10(residual.sugar)) and pH (-0.34).
The strongest correlation is with fixed acidity variable which was not obvisous from a physical stand point (correlation with alcohol was easier to infer)